From Python for Data Analysis:
Time series data is an important form of structured data in many different dielfds, such as finance, economics, ecology, neuroscience, and physics. Anything that is observed or measured at many points in time forms a time series. Many time series are fixed frequency, which is to say that data points occur at regular intervals according to some rule, such as every 15 seconds, every 5 minutes, or once per month. Time series can also be irregular without a fixed unit or time or offset between units. How you mark and refer to time series data depends on the application and you may have one of the following:
- timestamps, specific instants in time
- fixed periods, such as the month January 2007 or the full year 2010
- intervals of time, indicated by a start and end timestamp. Periods can be thought of as special cases of intervals
- Experiment or elapsed time; each timestamp is a measure of time relative to a particular start time. For example, the diameter of a cookie baking each second since being placed in the oven
Pandas
provides a standard set of time series tools and data algorithms. With this you can efficiently work with very large time series and easily slice and dice, aggregate, and resample irregular and fixed frequency time series. As you might guess, many of these tools are especially useful for financial and economics applications, but you could certainly use them to analyze server log data, too.
In [ ]:
from __future__ import division
from pandas import Series, DataFrame
import pandas as pd
from numpy.random import randn
import numpy as np
pd.options.display.max_rows = 12
np.set_printoptions(precision=4, suppress=True)
import matplotlib.pyplot as plt
plt.rc('figure', figsize=(12, 4))
In [ ]:
%matplotlib inline
In general, dealing with date arithmetic is hard. Luckily, Python
has a robust library that implements datetime
objects, which handle all of the annoying bits of date manipulation in a powerful way.
In [ ]:
from datetime import datetime
now = datetime.now()
now
Every datetime
object has a year
, month
, and day
field.
In [ ]:
now.year, now.month, now.day
You can do arithmetic on datetime
objects, which produce timedelta
objects.
In [ ]:
delta = datetime(2011, 1, 7) - datetime(2008, 6, 24, 8, 15)
delta
timedelta
objects are very similar to datetime
objects, with similar fields:
In [ ]:
delta.days
In [ ]:
delta.seconds
As you expect, arithmetic between datetime
and timedelta
objects produce datetime
objects.
In [ ]:
from datetime import timedelta
start = datetime(2011, 1, 7)
start + timedelta(12)
In [ ]:
start - 2 * timedelta(12)
In general, it is easier to format a string from a datetime
object than to parse a string date into a datetime
object.
In [ ]:
stamp = datetime(2011, 1, 3)
In [ ]:
str(stamp)
To format a string from a datetime
object, use the strftime
method. You can use the standard string-formatting delimiters that are used in computing.
In [ ]:
stamp.strftime('%Y-%m-%d')
To parse a string into a datetime
object, you can use the strptime
method, along with the relevant format.
In [ ]:
value = '2011-01-03'
datetime.strptime(value, '%Y-%m-%d')
Of course, this being Python
, we can easily abstract this process to list form using comprehensions.
In [ ]:
datestrs = ['7/6/2011', '8/6/2011']
[datetime.strptime(x, '%m/%d/%Y') for x in datestrs]
Without question, datetime.strptime
is the best way to parse a date, especially when you know the format a priori. However, it can be a bit annoying to have to write a format spec each time, especially for common date formats. In this case, you can use the parser.parse
method in the third party dateutil
package:
In [ ]:
from dateutil.parser import parse
parse('2011-01-03')
dateutil
is capable of parsing almost any human-intelligible date representation:
In [ ]:
parse('Jan 31, 1997 10:45 PM')
In international locales, day appearing before month is very common, so you can pass dayfirst=True
to indicate this:
In [ ]:
parse('6/12/2011', dayfirst=True)
Pandas
is generally oriented toward working with arrays of dates, whether used as an index or a column in a DataFrame
. The to_datetime
method parses many different kinds of date representations. Standard date formats like ISO8601 can be parsed very quickly.
In [ ]:
datestrs
In [ ]:
pd.to_datetime(datestrs)
Notice that the Pandas
object at work behind the scenes here is the DatetimeIndex
, which is a subclass of Index
. More on this later. to_datetime
also handles values that should be considered missing (None
, empty string, etc.):
In [ ]:
idx = pd.to_datetime(datestrs + [None])
idx
In [ ]:
idx[2]
In [ ]:
pd.isnull(idx)
datetime
objects also have a number of locale-specific formatting options for systems in other countries or languages. For example, the abbreviated month names will be different on German or French systems compared with English systems.
The most basic kind of time series object in Pandas
is a Series
indexed by timestamps, which is often represented external to Pandas
as Python
strings or datetime
objects.
In [ ]:
from datetime import datetime
dates = [datetime(2011, 1, 2), datetime(2011, 1, 5), datetime(2011, 1, 7),
datetime(2011, 1, 8), datetime(2011, 1, 10), datetime(2011, 1, 12)]
ts = Series(np.random.randn(6), index=dates)
ts
Under the hood, these datetime
objects have been put in a DatetimeIndex
, and the variable ts
is now of type TimeSeries
.
In [ ]:
type(ts)
# note: output changed to "pandas.core.series.Series"
In [ ]:
ts.index
Like other Series
, arithmetic operations between differently-indexed time series automatically align on the dates:
In [ ]:
ts + ts[::2]
Pandas
stores timestamps using NumPy
's datetime64
date type at the nanosecond resolution:
In [ ]:
ts.index.dtype
# note: output changed from dtype('datetime64[ns]') to dtype('<M8[ns]')
Scalar values from a DatetimeIndex
are Pandas
Timestamp
objects
In [ ]:
stamp = ts.index[0]
stamp
# note: output changed from <Timestamp: 2011-01-02 00:00:00> to Timestamp('2011-01-02 00:00:00')
A Timestamp
can be substituted anywhere you would use a datetime
object. Additionally, it can store frequency information (if any) and understands how to do time zone conversions and other kinds of manipulations. More on both of these things later.
TimeSeries
is a subclass of Series
and thus behaves in the same way with regard to indexing and selecting data based on label:
In [ ]:
stamp = ts.index[2]
ts[stamp]
As a convenience, you can also pass a string that is interpretable as a date:
In [ ]:
ts['1/10/2011']
In [ ]:
ts['20110110']
For longer time series, a year or only a year and month can be passed to easily select slices of data:
In [ ]:
longer_ts = Series(np.random.randn(1000),
index=pd.date_range('1/1/2000', periods=1000))
longer_ts
In [ ]:
longer_ts['2001']
In [ ]:
longer_ts['2001-05']
Slicing with dates works just like with a regular Series
In [ ]:
ts[datetime(2011, 1, 7):]
Because most time series data is ordered chronologically, you can slice with timestamps not contained in a time series to perform a range query:
In [ ]:
ts
In [ ]:
ts['1/6/2011':'1/11/2011']
As before you can pass either a string date, datetime
, or Timestamp
. Remember that slicing in this manner produces views on the source time series just like slicing NumPy
arrays. There is an equivalent instance method truncate
which slices a TimeSeries
between two dates:
In [ ]:
ts.truncate(after='1/9/2011')
All of the above holds true for DataFrame
as well, indexing on its rows:
In [ ]:
dates = pd.date_range('1/1/2000', periods=100, freq='W-WED')
long_df = DataFrame(np.random.randn(100, 4),
index=dates,
columns=['Colorado', 'Texas', 'New York', 'Ohio'])
long_df.ix['5-2001']
In some applications, there may be multiple data observations falling on a particular timestamp. Here is an example:
In [ ]:
dates = pd.DatetimeIndex(['1/1/2000', '1/2/2000', '1/2/2000', '1/2/2000',
'1/3/2000'])
dup_ts = Series(np.arange(5), index=dates)
dup_ts
We can tell that the index is not unique by checking its is_unique
property:
In [ ]:
dup_ts.index.is_unique
Indexing into this time series will now either produce scalar values or slices depending on whether a timestamp is duplicated:
In [ ]:
dup_ts['1/3/2000'] # not duplicated
In [ ]:
dup_ts['1/2/2000'] # duplicated
Suppose you want to aggregate the data having non-unique timestamps. One way to do this is to use groupby
and pass level=0
(the only level of indexing!):
In [ ]:
grouped = dup_ts.groupby(level=0)
grouped.mean()
In [ ]:
grouped.count()
Generic time series in Pandas
are assumed to be irregular; that is, they have no fixed frequency. For many applications this is sufficient. However, it's often desirable to work relative to a fixed frequency, such as daily, monthly, or even 15 minutes, even if that means introducing missing values into a time series. Fortunately Pandas
has a full suite of standard time series frequencies and tools for resampling, inferring frequencies, and generating fixed frequency date ranges. For example, in the example time series, converting it to be fixed daily frequency can be accomplished by calling resample
:
In [ ]:
ts
In [ ]:
ts.resample('D')
Conversion between frequencies or resampling is a big enough topic to have its own section later. Here, we'll see how to use the base frequencies and multiples thereof.
You may have guessed that pandas.date_range
is responsible for generating a DatetimeIndex
with an indicated length according to a particular frequency:
In [ ]:
index = pd.date_range('4/1/2012', '6/1/2012')
index
By default, date_range
generates daily timestamps. If you pass only a start or end date, you must pass a number of periods to generate:
In [ ]:
pd.date_range(start='4/1/2012', periods=20)
In [ ]:
pd.date_range(end='6/1/2012', periods=20)
The start and end dates define strict boundaries for the generated date index. For example, if you wanted a date index containing the last business day of each month, you would pass the 'BM'
frequency (business end of month) and only dates falling on or inside the date interval will be included:
In [ ]:
pd.date_range('1/1/2000', '12/1/2000', freq='BM')
date_range
by default preserves the time (if any) or the start or end timestamp:
In [ ]:
pd.date_range('5/2/2012 12:56:31', periods=5)
Sometimes you will have start or end dates with time information but want to generate a set of timestamps normalized to midnight as a convention. To do this, there is a normalize
option:
In [ ]:
pd.date_range('5/2/2012 12:56:31', periods=5, normalize=True)
Frequencies in Pandas
are composed of a base frequency and a multiplier. Base frequencies are typically referred to by a string alias, like 'M'
for monthly or 'H'
for hourly. For each base frequency, there is an object defined generally referred to as a date offset. For each example, hourly frequency can be represented with the Hour
class:
In [ ]:
from pandas.tseries.offsets import Hour, Minute
hour = Hour()
hour
You can define a multiple of an offset by passing an integer:
In [ ]:
four_hours = Hour(4)
four_hours
In most applications, you would never need to explicitly create one of these objects, instead using a string alias like 'H'
or '4H'
. Putting an integer before the base frequency creates a multiple:
In [ ]:
pd.date_range('1/1/2000', '1/3/2000 23:59', freq='4h')
Many offsets can be combined together by addition:
In [ ]:
Hour(2) + Minute(30)
Similarly, you can pass frequency strings like '2h30min'
which will effectively be parsed to the same expression.
In [ ]:
pd.date_range('1/1/2000', periods=10, freq='1h30min')
Some frequencies describe points in time that are not evenly spaced. For example, 'M'
(calendar month end) and 'BM'
(last business/weekday of month) depend on the number of days in a month and, in the latter case, whether the month ends on a weekend or not. For lack of a better term, we will call these anchored offsets.
One useful frequency class is "week of month", starting with WOM
. This enables you to get dates like the third Friday of each month:
In [ ]:
rng = pd.date_range('1/1/2012', '9/1/2012', freq='WOM-3FRI')
list(rng)
Traders of US equity options will recognize thse dates as the standard dates of monthly expiry.
"Shifting" refers to moving data backward and forward through time. Both Series
and DataFrame
have a shift
method for doing naive shifts forward or backward, leaving the index unmodified:
In [ ]:
ts = Series(np.random.randn(4),
index=pd.date_range('1/1/2000', periods=4, freq='M'))
ts
In [ ]:
ts.shift(2)
In [ ]:
ts.shift(-2)
A common use of shift
is computing percent changes in a time series or multiple time series as DataFrame
columns. This is expressed as
Because naive shifts leave the index unmodified, some data is discarded. Thus if the frequency is known, it can be passed to shift
to advance the timestamps instead of simply the data
In [ ]:
ts.shift(2, freq='M')
Other frequencies can be passed, too, giving you a lot of flexibility in how to lead and lag the data
In [ ]:
ts.shift(3, freq='D')
In [ ]:
ts.shift(1, freq='3D')
In [ ]:
ts.shift(1, freq='90T')
The Pandas
date offsets can also be used with datetime
or Timestamp
objects:
In [ ]:
from pandas.tseries.offsets import Day, MonthEnd
now = datetime(2011, 11, 17)
now + 3 * Day()
If you add an anchored offset like MonthEnd
, the first increment will roll forward
a date to the next date according to the frequency rule:
In [ ]:
now + MonthEnd()
In [ ]:
now + MonthEnd(2)
Anchored offsets can explicitly "roll" dates forward or backward using their rollforward
and rollback
methods, respectively:
In [ ]:
offset = MonthEnd()
offset.rollforward(now)
In [ ]:
offset.rollback(now)
A clever use of date offsets is to use these methods with groupby:
In [ ]:
ts = Series(np.random.randn(20),
index=pd.date_range('1/15/2000', periods=20, freq='4d'))
ts.groupby(offset.rollforward).mean()
Of course, an easier and faster way to do this is using resample
(more on this to come).
In [ ]:
ts.resample('M', how='mean')
Working with time zones is a pain. As Americans hold on dearly to daylight savings time, we must pay the price with difficult conversions between time zones. Many time series users choose to work with time series in coordinated universal time (UTC) of which time zones can be expressed as offsets.
In Python
we can use the pytz
library, based off the Olson database of world time zone data.
In [ ]:
import pytz
pytz.common_timezones[-5:]
To get a time zone object from pytz
, use pytz.timezone
.
In [ ]:
tz = pytz.timezone('US/Eastern')
tz
Methods in Pandas
will accept either time zone names or these objects. Using the names is recommended.
By default, time series in Pandas
are time zone naive. Consider the following time series:
In [ ]:
rng = pd.date_range('3/9/2012 9:30', periods=6, freq='D')
ts = Series(np.random.randn(len(rng)), index=rng)
The index's tz
field is None
:
In [ ]:
print(ts.index.tz)
Date ranges can be generated with a time zone set:
In [ ]:
pd.date_range('3/9/2012 9:30', periods=10, freq='D', tz='UTC')
Conversion from naive to localized is handled by the tz_localize
method
In [ ]:
ts_utc = ts.tz_localize('UTC')
ts_utc
In [ ]:
ts_utc.index
Once a time series has been localized to a particular time zone, it can be converted to another time zone using tz_convert
.
In [ ]:
ts_utc.tz_convert('US/Eastern')
In this case of the above time series, which straddles a DST transition in the US/Eastern time zone, we could localize to EST and convert to, say, UTC or Berlin time.
In [ ]:
ts_eastern = ts.tz_localize('US/Eastern')
ts_eastern.tz_convert('UTC')
In [ ]:
ts_eastern.tz_convert('Europe/Berlin')
tz_localize
and tz_convert
are also instance methods on DatetimeIndex
.
In [ ]:
ts.index.tz_localize('Asia/Shanghai')
Similar to time series and date ranges, individual Timestamp
objects similarly can be localized from naive to time zone-aware and converted from one time zone to another:
In [ ]:
stamp = pd.Timestamp('2011-03-12 04:00')
stamp_utc = stamp.tz_localize('utc')
stamp_utc.tz_convert('US/Eastern')
You can also pass a time zone when creating the Timestamp
.
In [ ]:
stamp_moscow = pd.Timestamp('2011-03-12 04:00', tz='Europe/Moscow')
stamp_moscow
Time zone-aware Timestamp
objects internally store a UTC timestamp value as nanoseconds since the UNIX epoch (January 1, 1970); this UTC value is invariant between time zone conversions:
In [ ]:
stamp_utc.value
In [ ]:
stamp_utc.tz_convert('US/Eastern').value
When performing time arithmetic using Pandas
' DateOffset
objects, daylight savings time transitions are respected where possible
In [ ]:
# 30 minutes before DST transition
from pandas.tseries.offsets import Hour
stamp = pd.Timestamp('2012-03-12 01:30', tz='US/Eastern')
stamp
In [ ]:
stamp + Hour()
In [ ]:
# 90 minutes before DST transition
stamp = pd.Timestamp('2012-11-04 00:30', tz='US/Eastern')
stamp
In [ ]:
stamp + 2 * Hour()
If two time series with different time zones are combined, the result will be UTC. Since the timestamps are stored under the hood in UTC, this is a straightforward operation and requires no conversion to happen.
In [ ]:
rng = pd.date_range('3/7/2012 9:30', periods=10, freq='B')
ts = Series(np.random.randn(len(rng)), index=rng)
ts
In [ ]:
ts1 = ts[:7].tz_localize('Europe/London')
ts2 = ts1[2:].tz_convert('Europe/Moscow')
result = ts1 + ts2
result.index